Drug addiction has long been hypothesized to be associated with the one’s own psychology. Some popular psychometric measures like the NEO-FFI-R (neuroticism, extraversion, openness to experience, agreeableness, and conscientiousness), the BIS-11 (impulsivity) and the ImpSS (sensation seeking) are meant to model the personality traits of a person. These measurements along with other controls like age, gender, education, country and ethnicity are collected to model for several indicators of substance abuse for different drugs.
A number of literatures have attempted to illustrate the relationship between the personality traits and substance abuse. For example, Roncero et al. noted the relationship between high N and cocaine-induced drug consumption. Another study from Vollrath & Torgersen noted that a low score C and a high score of N or E correlate strongly with hazardous health behaviors. Since substance abuse has long been considered as a pressing economic problem, we could attempt to model the relationships between covariates and the indicators for substance abuse so that policies could better target specific areas to maximize payoffs.
Instead of coding the categorical variables as factors, the dataset scales them so that the variables can better describe the distances in the metric space and appear as numeric.
A summary of the variables and their respective codes are as following:
| Attributes | Description |
|---|---|
| age | age of participant |
| gender | gender of participant |
| education | level of education |
| country | country of current residence of participant |
| ethnicity | ethnicity of participant |
| Nscore | NEO-FFI-R Neuroticism |
| Escore | NEO-FFI-R Extraversion |
| Oscore | NEO-FFI-R Openness to experience |
| Ascore | NEO-FFI-R Agreeableness |
| Cscore | NEO-FFI-R Conscientiousness |
| Impulsive | impulsiveness measured by BIS-11 |
| SS | sensation seeing measured by ImpSS |
| alcohol | class of alcohol consumption |
| amphet | class of amphetamines consumption |
| amyl | class of amyl nitrite consumption |
| benzo | class of benzodiazepine consumption |
| caff | class of caffeine consumption |
| cannabis | class of cannabis consumption |
| choc | class of chocolate consumption |
| coke | class of cocaine consumption |
| crack | class of crack consumption |
| ecstasy | class of ecstasy consumption |
| heroin | class of heroin consumption |
| ketamine | class of ketamine consumption |
| legalh | class of legal highs consumption |
| lsd | class of lsd consumption |
| meth | class of methadone consumption |
| mushroom | class of magic mushrooms consumption |
| nicotine | class of nicotine consumption |
| semer | class of fictitious drug Semeron consumption |
| vsa | class of volatile substance abuse consumption |
age| Value | Meaning |
|---|---|
| -0.95197 | 18-24 |
| -0.07854 | 25-34 |
| 0.49788 | 35-44 |
| 1.09449 | 45-54 |
| 1.82213 | 55-64 |
| 2.59171 | 65+ |
gender| Value | Meaning |
|---|---|
| 0.48246 | Female |
| -0.48246 | Male |
education| Value | Meaning |
|---|---|
| -2.43591 | Left school before 16 years |
| -1.73790 | Left school at 16 years |
| -1.43719 | Left school at 17 years |
| -1.22751 | Left school at 18 years |
| -0.61113 | Some college or university, no certificate or degree |
| -0.05921 | Professional certificate/ diploma |
| 0.45468 | University degree |
| 1.16365 | Masters degree |
| 1.98437 | Doctorate degree |
country| Value | Meaning |
|---|---|
| -0.09765 | Australia |
| 0.24923 | Canada |
| -0.46841 | New Zealand |
| -0.28519 | Other |
| 0.21128 | Republic of Ireland |
| 0.96082 | UK |
| -0.57009 | USA |
ethnicity| Value | Meaning |
|---|---|
| -0.50212 | Asian |
| -1.10702 | Black |
| 1.90725 | Mixed-Black/Asian |
| 0.12600 | Mixed-White/Asian |
| -0.22166 | Mixed-White/Black |
| 0.11440 | Other |
| -0.31685 | White |
The personality scores are calculated on a “continuous” scale so they are coded continuously, whie the substances are coded as the following:
| Value | Meaning |
|---|---|
| CLO | Never Used |
| CL1 | Used over a Decade Ago |
| CL2 | Used in Last Decade |
| CL3 | Used in Last Year |
| CL4 | Used in Last Month |
| CL5 | Used in Last Week |
| CL6 | Used in Last Day |
We could probe the dataset as following:
# Reading the data from source
drugs = read.csv("https://archive.ics.uci.edu/ml/machine-learning-databases/00373/drug_consumption.data", header = F)[-1]
# Naming the columns in order of the description on source.
colnames(drugs) = c(
"age",
"gender",
"education",
"country",
"ethinicity",
"Nscore",
"Escore",
"Oscore",
"Ascore",
"Cscore",
"Impulsive",
"SS",
"alcohol",
"amphet",
"amyl",
"benzo",
"caff",
"cannabis",
"choc",
"coke",
"crack",
"ecstasy",
"heroin",
"ketamine",
"legalh",
"lsd",
"meth",
"mushroom",
"nicotine",
"semer",
"vsa"
)
# Dimension of dataset
dim(drugs)
## [1] 1885 31
# First five observations of the dataset
head(drugs, 5)
## age gender education country ethinicity Nscore Escore
## 1 0.49788 0.48246 -0.05921 0.96082 0.12600 0.31287 -0.57545
## 2 -0.07854 -0.48246 1.98437 0.96082 -0.31685 -0.67825 1.93886
## 3 0.49788 -0.48246 -0.05921 0.96082 -0.31685 -0.46725 0.80523
## 4 -0.95197 0.48246 1.16365 0.96082 -0.31685 -0.14882 -0.80615
## 5 0.49788 0.48246 1.98437 0.96082 -0.31685 0.73545 -1.63340
## Oscore Ascore Cscore Impulsive SS alcohol amphet amyl benzo
## 1 -0.58331 -0.91699 -0.00665 -0.21712 -1.18084 CL5 CL2 CL0 CL2
## 2 1.43533 0.76096 -0.14277 -0.71126 -0.21575 CL5 CL2 CL2 CL0
## 3 -0.84732 -1.62090 -1.01450 -1.37983 0.40148 CL6 CL0 CL0 CL0
## 4 -0.01928 0.59042 0.58489 -1.37983 -1.18084 CL4 CL0 CL0 CL3
## 5 -0.45174 -0.30172 1.30612 -0.21712 -0.21575 CL4 CL1 CL1 CL0
## caff cannabis choc coke crack ecstasy heroin ketamine legalh lsd meth
## 1 CL6 CL0 CL5 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0
## 2 CL6 CL4 CL6 CL3 CL0 CL4 CL0 CL2 CL0 CL2 CL3
## 3 CL6 CL3 CL4 CL0 CL0 CL0 CL0 CL0 CL0 CL0 CL0
## 4 CL5 CL2 CL4 CL2 CL0 CL0 CL0 CL2 CL0 CL0 CL0
## 5 CL6 CL3 CL6 CL0 CL0 CL1 CL0 CL0 CL1 CL0 CL0
## mushroom nicotine semer vsa
## 1 CL0 CL2 CL0 CL0
## 2 CL0 CL4 CL0 CL0
## 3 CL1 CL0 CL0 CL0
## 4 CL0 CL2 CL0 CL0
## 5 CL2 CL2 CL0 CL0
# Summary of the dataset
summary(drugs)
## age gender education
## Min. :-0.95197 Min. :-0.4824600 Min. :-2.435910
## 1st Qu.:-0.95197 1st Qu.:-0.4824600 1st Qu.:-0.611130
## Median :-0.07854 Median :-0.4824600 Median :-0.059210
## Mean : 0.03461 Mean :-0.0002559 Mean :-0.003806
## 3rd Qu.: 0.49788 3rd Qu.: 0.4824600 3rd Qu.: 0.454680
## Max. : 2.59171 Max. : 0.4824600 Max. : 1.984370
##
## country ethinicity Nscore
## Min. :-0.5701 Min. :-1.1070 Min. :-3.464360
## 1st Qu.:-0.5701 1st Qu.:-0.3169 1st Qu.:-0.678250
## Median : 0.9608 Median :-0.3169 Median : 0.042570
## Mean : 0.3555 Mean :-0.3096 Mean : 0.000047
## 3rd Qu.: 0.9608 3rd Qu.:-0.3169 3rd Qu.: 0.629670
## Max. : 0.9608 Max. : 1.9072 Max. : 3.273930
##
## Escore Oscore Ascore
## Min. :-3.273930 Min. :-3.273930 Min. :-3.464360
## 1st Qu.:-0.695090 1st Qu.:-0.717270 1st Qu.:-0.606330
## Median : 0.003320 Median :-0.019280 Median :-0.017290
## Mean :-0.000163 Mean :-0.000534 Mean :-0.000245
## 3rd Qu.: 0.637790 3rd Qu.: 0.723300 3rd Qu.: 0.760960
## Max. : 3.273930 Max. : 2.901610 Max. : 3.464360
##
## Cscore Impulsive SS alcohol
## Min. :-3.464360 Min. :-2.555240 Min. :-2.078480 CL0: 34
## 1st Qu.:-0.652530 1st Qu.:-0.711260 1st Qu.:-0.525930 CL1: 34
## Median :-0.006650 Median :-0.217120 Median : 0.079870 CL2: 68
## Mean :-0.000386 Mean : 0.007216 Mean :-0.003292 CL3:198
## 3rd Qu.: 0.584890 3rd Qu.: 0.529750 3rd Qu.: 0.765400 CL4:287
## Max. : 3.464360 Max. : 2.901610 Max. : 1.921730 CL5:759
## CL6:505
## amphet amyl benzo caff cannabis choc coke
## CL0:976 CL0:1305 CL0:1000 CL0: 27 CL0:413 CL0: 32 CL0:1038
## CL1:230 CL1: 210 CL1: 116 CL1: 10 CL1:207 CL1: 3 CL1: 160
## CL2:243 CL2: 237 CL2: 234 CL2: 24 CL2:266 CL2: 10 CL2: 270
## CL3:198 CL3: 92 CL3: 236 CL3: 60 CL3:211 CL3: 54 CL3: 258
## CL4: 75 CL4: 24 CL4: 120 CL4: 106 CL4:140 CL4:296 CL4: 99
## CL5: 61 CL5: 14 CL5: 84 CL5: 273 CL5:185 CL5:683 CL5: 41
## CL6:102 CL6: 3 CL6: 95 CL6:1385 CL6:463 CL6:807 CL6: 19
## crack ecstasy heroin ketamine legalh lsd
## CL0:1627 CL0:1021 CL0:1605 CL0:1490 CL0:1094 CL0:1069
## CL1: 67 CL1: 113 CL1: 68 CL1: 45 CL1: 29 CL1: 259
## CL2: 112 CL2: 234 CL2: 94 CL2: 142 CL2: 198 CL2: 177
## CL3: 59 CL3: 277 CL3: 65 CL3: 129 CL3: 323 CL3: 214
## CL4: 9 CL4: 156 CL4: 24 CL4: 42 CL4: 110 CL4: 97
## CL5: 9 CL5: 63 CL5: 16 CL5: 33 CL5: 64 CL5: 56
## CL6: 2 CL6: 21 CL6: 13 CL6: 4 CL6: 67 CL6: 13
## meth mushroom nicotine semer vsa
## CL0:1429 CL0:982 CL0:428 CL0:1877 CL0:1455
## CL1: 39 CL1:209 CL1:193 CL1: 2 CL1: 200
## CL2: 97 CL2:260 CL2:204 CL2: 3 CL2: 135
## CL3: 149 CL3:275 CL3:185 CL3: 2 CL3: 61
## CL4: 50 CL4:115 CL4:108 CL4: 1 CL4: 13
## CL5: 48 CL5: 40 CL5:157 CL5: 14
## CL6: 73 CL6: 4 CL6:610 CL6: 7
# Structure of the dataset
str(drugs)
## 'data.frame': 1885 obs. of 31 variables:
## $ age : num 0.4979 -0.0785 0.4979 -0.952 0.4979 ...
## $ gender : num 0.482 -0.482 -0.482 0.482 0.482 ...
## $ education : num -0.0592 1.9844 -0.0592 1.1637 1.9844 ...
## $ country : num 0.961 0.961 0.961 0.961 0.961 ...
## $ ethinicity: num 0.126 -0.317 -0.317 -0.317 -0.317 ...
## $ Nscore : num 0.313 -0.678 -0.467 -0.149 0.735 ...
## $ Escore : num -0.575 1.939 0.805 -0.806 -1.633 ...
## $ Oscore : num -0.5833 1.4353 -0.8473 -0.0193 -0.4517 ...
## $ Ascore : num -0.917 0.761 -1.621 0.59 -0.302 ...
## $ Cscore : num -0.00665 -0.14277 -1.0145 0.58489 1.30612 ...
## $ Impulsive : num -0.217 -0.711 -1.38 -1.38 -0.217 ...
## $ SS : num -1.181 -0.216 0.401 -1.181 -0.216 ...
## $ alcohol : Factor w/ 7 levels "CL0","CL1","CL2",..: 6 6 7 5 5 3 7 6 5 7 ...
## $ amphet : Factor w/ 7 levels "CL0","CL1","CL2",..: 3 3 1 1 2 1 1 1 1 2 ...
## $ amyl : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 3 1 1 2 1 1 1 1 1 ...
## $ benzo : Factor w/ 7 levels "CL0","CL1","CL2",..: 3 1 1 4 1 1 1 1 1 2 ...
## $ caff : Factor w/ 7 levels "CL0","CL1","CL2",..: 7 7 7 6 7 7 7 7 7 7 ...
## $ cannabis : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 5 4 3 4 1 2 1 1 2 ...
## $ choc : Factor w/ 7 levels "CL0","CL1","CL2",..: 6 7 5 5 7 5 6 5 7 7 ...
## $ coke : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 4 1 3 1 1 1 1 1 1 ...
## $ crack : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ecstasy : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 5 1 1 2 1 1 1 1 1 ...
## $ heroin : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ ketamine : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 3 1 3 1 1 1 1 1 1 ...
## $ legalh : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 2 1 1 1 1 1 ...
## $ lsd : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 3 1 1 1 1 1 1 1 1 ...
## $ meth : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 4 1 1 1 1 1 1 1 1 ...
## $ mushroom : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 2 1 3 1 1 1 1 1 ...
## $ nicotine : Factor w/ 7 levels "CL0","CL1","CL2",..: 3 5 1 3 3 7 7 1 7 7 ...
## $ semer : Factor w/ 5 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ vsa : Factor w/ 7 levels "CL0","CL1","CL2",..: 1 1 1 1 1 1 1 1 1 1 ...
Now, after that brief introduction, we are ready to examine our data more closely.
pkg_list = c("tidyverse", "corrplot", "gridExtra")
mia_pkgs = pkg_list[!(pkg_list %in% installed.packages()[,"Package"])]
if(length(mia_pkgs) > 0) install.packages(mia_pkgs)
loaded_pkgs = lapply(pkg_list, require, character.only=TRUE)
Correlation matrix:
# correlation between covariates
corr = cor(drugs[1:12]) %>% round(1)
corrplot(corr, method = "color", addCoef.col = "black", type = "upper")
We noticed that none of the covariates are strongly correlated, so we can assume that none of them are collinear.
Now we will try to find the correlation between some potential dependent variables and independent variables. Based on the data description, we have 20 dependent variables with 6 classes each. Since it is hard to visualize the correlation with factor variables (since the covariates are technically factor variables before quantified, however still random), then we will use the correlogram to visualize the relationships like previously.
y = lapply(drugs[13:31], as.character) %>%
lapply(., FUN = str_extract, pattern = "[:digit:]") %>%
lapply(., as.numeric) %>%
do.call(cbind, .)
corr2 = round(cor(drugs[1:12], y), 1)
corrplot(corr2, method = "color", addCoef.col = "black")
Scatter plots:
Now, we will attempt to plot 9 of the most correlated relationships.
y.new = drugs[, c("cannabis", "lsd", "mushroom")]
x.new = drugs[, c("country", "Oscore", "SS")]
drug.new = cbind.data.frame(x.new, y.new)
# Plotting y variables against x variables (with geom_jitter to see density at each point)
count = 0
for(i in 1:ncol(x.new)){
for(j in 1:ncol(y.new)){
count = count+1
assign(paste0("p", count),
ggplot(drug.new, aes_string(x = names(x.new[i]), y = names(y.new[j]))) +
theme_bw() +
geom_jitter(color = "darkblue", alpha = 0.3) +
labs(title = paste(names(y.new[j]), "vs.", names(x.new[i])))
)
}
}
# Printing the plots
mget(paste0("p", 1:9))
## $p1
##
## $p2
##
## $p3
##
## $p4
##
## $p5
##
## $p6
##
## $p7
##
## $p8
##
## $p9
As we can see, the relationships are certainly unusual given the nature of our dataset. However correlated, some covariates appear to be discrete. For this reason, we would have to utilize the variable selection algorithms that would perform well even if the variables are not continuous.
Density plots:
If we look at the density plots for each personality scores, we will find that each of them approximate a normal distribution.
# Extracting the personality scores
scores = drugs[6:12]
for(i in 1:ncol(scores)){
assign( paste0("score", i),
ggplot(drugs, aes_string(x = names(scores[i]))) +
geom_density(color = "darkblue") +
theme_bw() +
scale_y_continuous(labels = function(x) paste0(x*100, "%")) +
labs(title = names(scores[i]))
)
}
# Plotting
grid.arrange(score1, score2, score3, score4, score5, score6, score7,
nrow = 2)
If we assume independence, then \(\sum_{i= 1}^{7} Z_i \sim N(\sum_{i=1}^{7} \mu_i, \sum_{i=1}^{7} \sigma_i^2)\), which could prove useful for hypothesis testing of parametric models or for creating another normal variable.
We could also see that the distributions have already been standardized.
# Mean
(mu = colMeans(scores))
## Nscore Escore Oscore Ascore Cscore
## 4.660477e-05 -1.628011e-04 -5.343979e-04 -2.449655e-04 -3.860690e-04
## Impulsive SS
## 7.216064e-03 -3.291666e-03
# Variance
(sigma = apply(scores, 2, var))
## Nscore Escore Oscore Ascore Cscore Impulsive SS
## 0.9962155 0.9949035 0.9924713 0.9948875 0.9950514 0.9109452 0.9287199
Now, if we investigate the dependent variables, we could uncover a few interesting information.
# Changing from wide to long format
long = drugs %>%
gather(type, class, alcohol:vsa)
# Plotting frequencies by substance
ggplot(long, aes(x = class, y = ..count.., fill = type)) +
geom_bar() +
facet_wrap(~type) +
theme_bw() +
theme(axis.text.x = element_text(angle = 90))
Since the classes are coded as CL0 for “Never Used” and CL1:6 for “Has Been Used At Least Once”, then we can do the following transformation:
# Code "CL0" as 0 and the rest as 1
substance.binary = apply(drugs[,13:31], 2,
FUN = function(i) ifelse(i == "CL0", yes = "0", no = "1")) %>%
data.frame
# Changing from long to wide format
long2 = substance.binary %>%
gather(type, class, alcohol:vsa)
# Saving the binary code in a dataset
drugs.binary = drugs %>% .[-(13:31)] %>% cbind.data.frame(., substance.binary)
# Plotting
ggplot(long2, aes(x = class, y = ..count.., fill = type)) +
geom_bar() +
facet_wrap(~type) +
theme_bw()
We can see that almost all drugs have “takers”, except Semeron, which seems to have significantly more “never-takers”. This is indicated in page 4 of the data description that Semeron is meant to be a fictitious drug to identify over-claimers.
Evidently, this is a classification problem that we could attempt to solve by using the different Machine Learning techniques. Some objectives that we could attempt to solve are: